13 research outputs found

    Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One of the challenges in the analysis of microarray data is to integrate and compare the selected (e.g., differential) gene lists from multiple experiments for common or unique underlying biological themes. A common way to approach this problem is to extract common genes from these gene lists and then subject these genes to enrichment analysis to reveal the underlying biology. However, the capacity of this approach is largely restricted by the limited number of common genes shared by datasets from multiple experiments, which could be caused by the complexity of the biological system itself.</p> <p>Results</p> <p>We now introduce a new Pathway Pattern Extraction Pipeline (PPEP), which extends the existing WPS application by providing a new pathway-level comparative analysis scheme. To facilitate comparing and correlating results from different studies and sources, PPEP contains new interfaces that allow evaluation of the pathway-level enrichment patterns across multiple gene lists. As an exploratory tool, this analysis pipeline may help reveal the underlying biological themes at both the pathway and gene levels. The analysis scheme provided by PPEP begins with multiple gene lists, which may be derived from different studies in terms of the biological contexts, applied technologies, or methodologies. These lists are then subjected to pathway-level comparative analysis for extraction of pathway-level patterns. This analysis pipeline helps to explore the commonality or uniqueness of these lists at the level of pathways or biological processes from different but relevant biological systems using a combination of statistical enrichment measurements, pathway-level pattern extraction, and graphical display of the relationships of genes and their associated pathways as Gene-Term Association Networks (GTANs) within the WPS platform. As a proof of concept, we have used the new method to analyze many datasets from our collaborators as well as some public microarray datasets.</p> <p>Conclusion</p> <p>This tool provides a new pathway-level analysis scheme for integrative and comparative analysis of data derived from different but relevant systems. The tool is freely available as a Pathway Pattern Extraction Pipeline implemented in our existing software package WPS, which can be obtained at <url>http://www.abcc.ncifcrf.gov/wps/wps_index.php</url></p

    Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data

    Get PDF
    <div><p>As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.</p></div

    Growth of articles in MEDLINE.

    No full text
    <p>A bar chart displaying the number of baseline records in NLM MEDLINE’s 2001 baseline release to 2012 baseline release. (<a href="http://www.nlm.nih.gov/bsd/licensee/2012_stats/baseline_doc.html" target="_blank">http://www.nlm.nih.gov/bsd/licensee/2012_stats/baseline_doc.html</a>).</p

    Network of Cancer-Gene associations from literature.

    No full text
    <p>Network of Cancer/Gene associations displaying shared genes between cancers and genes specific to certain cancer types based on literature evidence. Cancer terms are represented as labeled nodes, genes are unlabeled pink nodes and the edges represent at least one publication with a co-occurrence of the cancer term and gene.</p

    Architecture for integrating structured and unstructured data in Hadoop.

    No full text
    <p>Architectural diagram detailing the steps in creating the categorical lexicons and using them to get the PMID counts from literature. DEG stands for Differentially Expressed Gene while DE miRNA stands for Differentially Expressed miRNA.</p

    Bubble chart of Cancer-Gene associations from literature.

    No full text
    <p>A bubble chart representation with cancer terms on the x-axis and genes on the y-axis. The size of the bubble is directly proportional to the number of literature articles where the cancer and gene terms co-occur.</p

    Cancer term occurrences in the literature.

    No full text
    <p>A bar chart representation with cancer terms on the y-axis and publication counts on the x-axis. Only the cancer terms with high literature occurrences are shown.</p
    corecore